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Abstract —We consider approximate maximum likelihood pa¬ 
rameter estimation in nonlinear state-space models. We discuss 
both direct optimization of the likelihood and expectation- 
maximization (EM). For EM, we also give closed-form ex¬ 
pressions for the maximization step in a class of models that 
are linear in parameters and have additive noise. To obtain 
approximations to the filtering and smoothing distributions 
needed in the likelihood-maximization methods, we focus on 
using Gaussian filtering and smoothing algorithms that employ 
sigma-points to approximate the required integrals. We discuss 
different sigma-point schemes based on the third, fifth, seventh, 
and ninth order unscented transforms and the Gauss-Hermite 
quadrature ruie. We compare the performance of the methods 
in two simuiated experiments: a univariate nonlinear growth 
model as well as tracking of a maneuvering target. In the 
experiments, we also compare against approximate likelihood 
estimates obtained by particle filtering and extended Kalman 
filtering based methods. The experiments suggest that the higher- 
order unscented transforms may in some cases provide more 
accurate estimates. 

I. Introduction 

T HIS paper is an extended version of our article |Q]] where 
we considered parameter estimation in state-space models 
using expectation-maximization (EM) algorithms based on 
sigma-point and particle smoothers. In this paper, we extend 
our interest from EM algorithms to so called direct maximum 
likelihood based parameter estimation methods, where instead 
of using the EM algorithm, the marginal likelihood of the 
parameters is directly approximated using nonlinear filtering 
methods. In particular, we focus our interest to sigma-point 
filters which use high-order unscented Kalman filters and 
Gauss-Hermite Kalman filters to approximate the likelihood 
surface. 

We consider state-space models of the following form: 

x fc = f(x fc —1,0) + qfc_i, 

y fc = h(x fe ,0) + r k , 

where x/- G K" is the discrete-time state sequence with an 
initial distribution xo ~ N(xo | mo(0), Po(<?)), y k G is 
the measurement sequence, q~ N(0, Q (0)) is the Gaussian 
process noise sequence, ~ N(O,R(0)) is the Gaussian 
measurement error sequence, and 6 G M r " is a static parameter 
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vector. Typically, one is interested in computing the posterior 
distribution of the state x/ r given measurements up to time 
k, p(xfc | yi,...,yfc), known as the filtering problem, or 
computing the posterior distribution of the state x/, : given all 
measurements, p(x^ | yi,..., y:r)> where k < T, known 
as the smoothing problem. In the general case, analytical 
expressions do not exist and we have to resort to approximative 
algorithms such as the sigma-point methods. See, for example, 
yD for a general overview of Bayesian filtering and smoothing. 

While many filtering and smoothing algorithms are formu¬ 
lated assuming fixed static parameters 6, in practice optimal 
values for these parameters are generally unknown. Therefore, 
methods for estimating the parameters from the data are 
desired. In this paper, we concentrate on maximum-likelihood 
methods, where the parameters are selected by maximizing 
the marginal likelihood, or equivalently the logarithm of the 
marginal likelihood, that is 

0ml = argmax 0 logp(yi:T | 0). (2) 

In linear systems with additive Gaussian noise, the likeli¬ 
hood can be evaluated using the Kalman filter y|,l4|]- Many 
optimization algorithms utilize also the gradient of the log- 
likelihood. The gradient can be evaluated by so-called sensi¬ 
tivity equations, a recursion that is obtained by differentiating 
the Kalman filter recursion @]. Alternatively, due to Fisher’s 
identity, the gradient may be evaluated by differentiating an 
auxiliary function that can be computed during the smoothing 
pass HEzb- Instead of directly optimizing the likelihood, the 
expectation-maximization (EM) algorithm [[§|] can be used to 
optimize parameters. The EM algorithm consists of iterating 
the expectation (E) step where a bound of the log-likelihood 
is computed using the current parameter estimates, and the 
maximization (M) step where the bound is maximized with 
respect to the parameters. The evaluation of the bound in the 
E-step is obtained by solving the smoothing problem. See |3l 
for a discussion of applying the EM algorithm in state-space 
models. Note that in the linear-Gaussian case both gradient 
evaluation methods as well as the EM algorithm in principle 
converge to the same solution, namely, the parameter value 
that maximizes the log-likelihood. 

In this paper, our interest lies in estimating the static 
parameters by maximum-likelihood estimation in the case of 
nonlinear state-space models with additive Gaussian noise, 
that is model Q. Formally, the marginal likelihood can be 
computed by marginalizing out the states from the joint distri¬ 
bution of the measurements and states using nonlinear filtering 
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equations and the prediction error decomposition (see, e.g., 
yl 10fl), leading to similar methods as in the linear-Gaussian 
case. However, since the state variables x cannot in general 
be marginalized out analytically, one needs to employ approx¬ 
imative methods. In the so called direct likelihood methods, 
the likelihood is approximated directly using approximative 
nonlinear filtering methods (see, e.g., [i| fiol- 131) and its 
maximum is found via nonlinear optimization. Similarly, the 
expectation-maximization (EM) algorithm can be employed, 
but the E-step cannot be solved exactly. Instead, the E-step is 
approximated with nonlinear smoothing algorithms (see, e.g., 

Cl Ml). 

The aim of this paper is to extend the results of our paper H 
by showing how high-order (i.e., third, fifth, seventh, and ninth 
order) unscented transforms and Gauss-Hermite integration 
based sigma-point methods can be used for approximate direct 
likelihood and EM-based parameter estimation in nonlinear 
state-space models. For EM, we also give closed-form ex¬ 
pressions for the maximization step in a class of models that 
are linear in parameters and have additive noise. We compare 
the unscented transform and Gauss-Hermite based sigma- 
point methods to linearization-based extended Kalman filter 
algorithms and Monte Carlo based particle filtering algorithms. 
We also provide an algorithm for computing the gradients re¬ 
quired by the gradient-based optimization methods. Although 
we focus on maximum likelihood estimation, the provided 
algorithms can be easily extended to computation of maximum 
a posteriori estimates by including a prior distribution to the 
objective function. 


II. Sigma-Point Filtering and Smoothing 

Under our interpretation, sigma-point filtering and smooth¬ 
ing is derived by assuming Gaussian approximations for the 
state distributions, which enables the use of a Kalman filter 
like filtering recursion and a Rauch-Tung-Striebel backward 
pass for the smoothing distributions. The Gaussian filtering 
and smoothing equations contain expectations over Gaussian 
distributions which cannot be generally evaluated in closed 
form. The sigma-points arise from approximating these Gaus¬ 
sian integrals by weighted sums determined by some cubature 
(multi-dimensional quadrature) formula. Hence, we interpret 
the different sigma-point methods as incarnations of different 
integral approximations. 

In the following, we first present the assumed Gaussian 
density filtering and smoothing framework. Then, we dis¬ 
cuss various different cubature rules for approximating the 
Gaussian integrals. Finally, we show how the cubature rules 
are applied to the assumed Gaussian density filtering and 
smoothing framework to obtain the filtering and smoothing 
equations explicitly in the sigma-point form. 


A. General Gaussian Filtering and Smoothing 

Assumed density Gaussian filtering (see 00111) is based 
on assuming that the filtering distributions are approximately 
Gaussian, that is, assuming means m fc | fe and covariances Pfe| fc 
such that 


p(x k | yi:fe) ~ N(x fc I m fc | fe , Pfc|fc) 


(3) 


as well as means m fe | fe+1 and covariances P fc |fe +1 such that 


p(x k+ i I yi:fc) ~ N(x fc+ 1 I m fe | fc+1 ,P A .| fc+1 ). (4) 


The filtering equations of the resulting Gaussian filter [18 


consist of a prediction step and an update step. In the predic¬ 
tion step, we compute the state mean and covariance of the 
distribution p(x& | yi : fc_i) using the Gaussian approximation 
for p(xfc_ 1 | yi : fc_i). The resulting equations are 


m k]k _ 1 = E[f(x fc _i)], 

P fc |fe_i = E[(f(x fe _i) - mfcifc.i) (5) 

X (f(Xfe_i) - Hlfclfc.!) 1 "] + Q, 

where the expectations are taken with respect to the distribu¬ 
tion x fc _i ~ N(m fc _ 1 | fc _ 1 ,P fc _ 1 | fc _ 1 ). 

In the corresponding update step, we assume a Gaussian 
density p(x fc | yi :k -i) = N(x fc | m fc | fe _ 1 , Pfe| fc _i) and 
compute the state mean and covariance for the distribution 
p(x k | yi : fc). The resulting equations are 


p k = E[h(x fe )], 

S fc = E[(h(x fc ) - /life) (h(xfe) - p k ) T ] + R, 
Cfe = E[(xfe - m fc |fc_ 1 ) (h(xfe) - /ifc) T ], 

Kfe = Cfe SjT 1 , 

m k\k — tiife|fe_i T Kfe (yfc /ifc); 

Pfc|fc = Pfe|fe—1 Kfe Sfe Kfe , 


( 6 ) 


where the expectations are taken with respect to the distribu¬ 
tion Xfe ~ N(mfe|fc_ 1 ,Pfc|fc_ 1 ). 

The smoothing distributions p(x k | y 1 -t) are obtained from 
a backward pass, that is, starting from k = T and iterating 
backwards in time. On each step, the smoothing density of 
Xfe + i is assumed to be Gaussian: p(xfc + i | yi-.r) = N(xfc + i | 
mfe+riT) Pfc+i|r). The mean and covariance for p(x k \ yi-.r) 
are then computed from the previous Gaussian smoothing 
density and the Gaussian filtering densities using the Rauch- 
Tung-Striebel backward pass 121. 22 1 as follows 1191: 


m fc+ i|fe = E[f(xfe)], 

Pfe+i|fc = E[(f(xfc) - m fc+1 |fe) 

x (f(xfe) - mfe +1 |fc) T ] + Q, 

Dfe + i = E[(xfc - m fc |fe) (f(x fc ) - mfc +1 |fe) T ], (7) 

Gfe — Dfe + i [Pfe_j_i|fe] , 

m fc|T = m fc|fe + Gfe (mfe +1 | T — mfe + 1 |fe), 

Pfc|T = Pfc|fc + Gfe (Pfc+r|T ~ Pfc+l|fc) Gfe, 

where the expectations are taken with respect to the distri¬ 
bution Xfe ~ N(rrife|fe,Pfc|fe). The pairwise joint smoothing 
distributions p(xfe,Xfe_ 1 | yi : T) are also of interest since 
they are used in the expectation-maximization algorithm (see 
Section UlI-Bb . Gaussian approximations for these distributions 
are obtained as a by-product of the smoothing backward pass 
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results as follows (see, e.g., 0 P- 189]). 


p(x fc ,Xfc_ 1 I yi:r) « 


N 


( 


f x fc 


f m k\T \ 

V m fc-:i|T/ 


Pfc|T 

Gfc_iPfe|T Pfc —1|T JJ 

( 8 ) 


B. Approximating the Gaussian Integrals 

As we saw in the previous section, during the evaluation 
of the prediction and update steps of the Gaussian filter and 
smoother, we need to solve a set of Gaussian integrals on each 
step. These integrals are of the following form: 


E [g( x )] = / g( x )N( 

J R n 


x | m, P) dx, 


(9) 


where g : R™ —> R d is the integrand and the weighting func¬ 
tion N(x | m, P) is a multi-dimensional Gaussian density with 
mean m and covariance matrix P. In this paper, these integrals 
are computed by using multi-dimensional generalizations of 
Gaussian quadratures—also referred to as Gaussian cubatures 
1 231] . They give approximations of the form 


E[g(x)] « (10) 

i 


where the weights w, and sigma-points x, are functions of the 
mean m and covariance P of the Gaussian weighting function. 
The sigma-points are positioned as follows: 


Xi = m + L 6, 


( 11 ) 


where are method specific unit sigma-points, and L is 
a matrix square-root factor such that P = LL T (e.g., the 
Cholesky decomposition of P). The differences in the methods 
come from different choices of weights and unit sigma-points. 

In the following we briefly introduce a number of schemes 
for choosing the weights and sigma-points. The difference 
between these schemes stems from a trade-off between the 
number of sigma-points (required function evaluations) versus 
precision in the approximation. The degree of approximation 
is quantified by the highest polynomial order, p, for which the 
method is exact. 


Unscented transform. The unscented transform (UT, [24, 


25]) uses a set of 2n + 1 cubature points located in the 


center and on the surface of an n-sphere. The radius and 
the weights can be controlled using a set of parameters. 
The cubature points are given by: 


6 = 0 , 

_ f s/X + net, i = l,...,n, 

\ -yf\ Tne,_ n , i = n + 1,... ,2n, 

where e, denotes a unit vector to the direction of coor¬ 
dinate axis i, and the weights are defined as follows: 


(0) _ / n+A’ 


for mean terms, 

l i + 

— a 2 + (I), 

for covariance terms 

,(i) - 1 


,2 n, 

2 (n + A) ’ 


where A = a 2 (n + n) — n and a, ft, and n are parameters 
of the method. 

Symmetric, 3 rd order. A widely applicable sigma-point 
scheme is constructed by setting the unscented transform 
parameters to a = ±1, /3 = 0, and n = 0 [|20]. This is 
also known as the 3 rd order symmetric spherical-radial 
cubature method (CKF, |2<§]; see |[Z7] for the explicit 
connection). This method utilizes a scaled and rotated set 
of 2 n points, which are selected to be at the intersections 
of an n-sphere and the coordinate axes: 

t. f y/nei , i = l,2 ,...,n, 

( yfrt Gj— ji , 1 — Tl 4- 1, . . . , 2 Tl. 


The weights are defined as w, = 1/(2 n) for i = 
1,2,..., 2 n. The number of evaluation points is a lin¬ 
ear function of the state dimension. The corresponding 
sigma-point filter is referred to as UKF 3. 

Symmetric, 5 th order. Building upon the work of McNamee 
and Stenger [ 281], it is possible to find explicit fully 
symmetric integration formulas of higher order than 
three. These integration schemes are exact for symmetric 
polynomials up to a given order p. For order p = 5, 
the number of required sigma-points is 2 n 2 + 1. The 
corresponding sigma-point filter is referred to as UKF 5. 

Symmetric, 7 th order. For order p = 7, the number of 
required sigma-points is |(4 n 3 + 8n + 3), meaning that 
they scale cubicly with the number of state dimensions. 
The corresponding sigma-point filter is referred to as 
UKF 7. 


Symmetric, 9 th order. For order p = 9, the number of 
required sigma-points is ^(2n 4 —4?z 3 +22n 2 —8n+3). The 
corresponding sigma-point filter is referred to as UKF 9. 
If required, even higher order methods can be constructed 
in the spirit of |28ll. 

Gauss-Hermite. The n-dimensional Gauss-Hermite quadra¬ 
ture method forms the sigma-points as a Cartesian 
product of the one-dimensional Gauss-Hermite quadra¬ 
tures, and the weights are simply products of the one¬ 
dimensional weights 1 18 . a I 23 I 1 . The disadvantage of 
this method is that with a pth order GH approximation 
(exact for polynomials up to order p), the required 
number of evaluation points is p n , the number growing 
exponentially with state dimension n. The corresponding 
filter is referred to as GHKF. 


The exact formulas for the higher-order methods become 
lengthy and have been omitted here for brevity (see, i2C& Eii 
[ 29 ], for implementation details and discussion). Figure Q] gives 
a pictorial example of how the points and weights are placed 
in two dimensions (n = 2) for each of the methods. Note 
that even though the 5 th and 9 th order methods do not have 
negative weights when n = 2, they have negative weights in 
other dimensions. 

For higher state dimensions. Figure [2] shows how the 
number of required points scale in each of the schemes. The 
exponentially growing number of evaluation points for Gauss- 
Hermite is apparent in Figure [2] In the UKFs, the number of 
evaluation points grow polynomially. McNamee and Stenger 
provide the following bound for the number of evaluation 
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• • • 


• • 

• • 


-4 0 4 

(a) Symmetric (order 3) 


-4 0 4 

(b) Symmetric (order 5) 


-4 0 4 

(c) Symmetric (order 7) 


-4 0 4 

(d) Symmetric (order 9) 


-4 0 4 

(e) Gauss-Hermite 


Fig. 1. Unit sigma-points in two dimensions for each of the methods. The absolute value of the weights are indicated by the point size, positive weights 
being black, negative weights white, (a) The symmetric cubature rule of order p = 3 with 4 points, (b) The symmetric cubature rule of order p = 5 with 
9 points, (c) The symmetric cubature rule of order p = 7 with 17 points, (d) The symmetric cubature rule of order p = 9 with 25 points, (e) For comparison, 
the Gauss-Hermite (order p = 9) sigma-points (81 points, many of which with very small weights) are also shown. 



Fig. 2. Scaling of the number of sigma-points for each of the symmetric 
methods (solid lines). The required number of sigma-points for the Gauss- 
Hermite cubature of corresponding order p is visualized by the dashed lines. 


points for the fully symmetric integration formulas of arbitrary 
degree p = 2k + 1 in n-space: 0((2 n) k /k\). Note that while 
in this paper we focus on the higher-order methods based on 
McNamee and Stenger, alternative cubature rules have also 
been suggested (see pIEI). 


C. Sigma-Point Filtering and Smoothing 

The following sigma-point filtering and smoothing equa¬ 
tions are obtained by selecting a cubature rule, for example, 
one of the rules discussed in Section urn and substituting 
it in place of the expectations in the Gaussian filtering and 
smoothing equations (Sec. III-AI Eqs. [5j[7]i. 

In the following equations, we denote the lower triangular 
matrix square-root (Cholesky) factor of a covariance matrix 
P by L so that for example Pfc|fc_i = Lfclfc-iUlfc-r The 
prediction step is 

^ tvi f (m fc _ 1 | fe _ 1 L fc _ 1 | fc _ 1 &) , 

i 

Pfc|fc—1 “ ^ ^ (f (mfc—l|fc —1 T* Lfc— l\k— 1 t*l/c|fc_l) 

i 

X (f ( m A;-l|fc-l + Lfc-l|fc-l £i) — m k \ k _i) | + Q 

( 12 ) 


and the update step is 

Fk = ^2 Wi h (rn.k\k-i + Lfc|fc-i £;,) , 

i 

Sfc = ^ (h (nifcifc.! + L fc | fc _ 1 &) - /x fc ) 

i 

x (h (m fc | fc _i + Lfc|fc_i &) - fj, k ) T | + R, 

= ^ ^ i $,i (1^) 

i 

x (h (nifcifc.! + Lfclfc.! ii) - Hk) T j, 

K k = C k S~\ 

= nifcifc.! + K fe (y k - fi k ), 

Pfc|fc = Pfe|fe-i - K fc S fc K.J. 

The Rauch-Tung-Striebel smoother equations are 

^ ^ f ( YH-k\k T ^k\k •> 

i 

P k+l\k = ^2 (^ ( m fc| fc + Us|fc &) ” m fc+l|fc) 

i 

X (f (m fe | fc + L fc | fc £i) - m fe+ i| fe ) T | + Q, 

Dfc+i = 57 I tut (f 5 

i 

G k = Dfc + i [Pfc+i|fe] 1 , 

m A:|T = m fc|fc + Gfe (mfe+l|T — m fc+l|fc)> 

Pfe|T = Pfe|fe + G k (Pfc + i|T — Pfc + i|fc) Gj. 

(14) 


To evaluate expectations with respect to the pairwise 
smoothing distributions (Eq. [8|, the required 2n-dimensional 
sigma-points need to be generated separately as they are not 
used in the smoother pass. The sigma-points used for the 
pairwise smoothing distributions are of the form 



*-k\T 

Gfc-lPfc|2 


P.itG 1 


k\T^k-l \ 
* k—l\T 


(2n) 


where 22''' 1 are the 2 n-dimensional unit sigma-points. Then, 
expectation a function f(x;.. x/,._i) is approximated as 


(15) 


E(f(x fc ,x fe _i) | y 1:T ) = 2_ w i 


f(x; 


k k 


_l), (16) 
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where w[ 2n ' > are the corresponding weights of the 271- 
dimensional sigma-point scheme. 

III. Parameter Estimation 

In this section, we consider methods for estimating the 
static parameters 9 of the state-space model ([TJ. All methods 
discussed target the maximum likelihood solution, that is, aim 
to maximize p(yi-.T I 0), or equivalently the log-likelihood: 


that is, integrating the measurement model p(yk \ x^, 0) over 
the predicted state distribution p(x.k | yi-k-i,9) which is 
computed during the Bayesian filtering recursion. In assumed 
density Gaussian filtering is used, we get the approximation 

p{yk | yi:fc_i,0) « N(y fc I /r fe ,Sfc), (21) 

whence the marginal log-likelihood expression in Equa¬ 
tion ( [T~9| > evaluates to 


0ml = argmax e logp(yi :T | 9). (17) 

Since the state variables xo -t cannot in general be marginal¬ 
ized in closed-form, approximative numeric methods are 
needed. 

Here, we focus on three approaches where sigma-point 
filtering and smoothing is used to approximate the likeli¬ 
hood. First, we consider a so-called direct-likelihood approach, 
where the sigma-point algorithm is used to directly approx¬ 
imate the log-likelihood and its gradient, which are then 
used in numeric optimization algorithms such as conjugate- 
gradient optimization. Second, the expectation-maximization 
(EM) algorithm which is based on a lower bound for the 
log-likelihood and iterating optimization of parameters with 
respect to the lower bound and updating the lower bound with 
new parameters. The third approach is a modification of the 
direct-likelihood optimization where Fisher’s identity is used 
to express the gradient of the log-likelihood using the same 
lower bound function that appears in the EM algorithm. Note 
that the third approach is otherwise similar to the first, but 
since it is based on the EM lower bound, we present the 
methods in this order. Each of these three approaches may 
be used in combination with any of the sigma-point rules 
discussed in Section Hl-BI 

Note that all the algorithms presented in this section are 
easily extended to maximum a posteriori estimation since 
maximizing the posterior density is equivalent to maximizing 
the (unnormalized) log-posterior. That is, the sum of log- 
likelihood and log-prior: 


C t (9) = logp(yi :T | 9) « - -^log|27rS A 


k =1 

1 T 

- 2 ( yfc - Vk) T S” 1 (y fc - p,k) , (22) 

z fc=i 

where the quantities pk and S k are evaluated during the 
filtering recursion, in the case of sigma-point methods by 
Equation (IT3l >. 

To enable use of gradient-based optimization algorithms, 
we also need a method for evaluating the gradients of the 
marginal log-likelihood. This is based on the so-called sensi¬ 
tivity equations 0, 32] that are obtained by differentiating the 
filtering equations. Namely, the gradient of the log-likelihood 
is obtained by the recursion 


acfc(g) _ acfc-i(e) i 

dOi dO; 2 


S k\0) 


dS k (9) 
d6 t 


- v T k (9)S^(9)^l 
+ ^vJ(O) S l\9) S ~ k \9) v k (9), (23) 


where v fc = y k — ///,.. The derivatives dS gg°^ 


and 


ddi 


computed along the filtering pass by the equations shown in 
Figure [3] 


B. Expectation-Maximization Based Parameter Estimation 


#map = argmax e [logp(yi :T | 9) + logp(0)]. (18) 

Since the log-prior is known, approximations of the unnormal¬ 
ized log-posterior as well as its gradient and lower bounds are 
immediately obtained from the corresponding approximations 
for the log-likelihood. 

A. Direct Likelihood Based Parameter Estimation 

The marginal log-likelihood function can be formulated as 
the following sum 

T 

£t{0) = ^2^ogp{y k | yi:k-i,0). (19) 

k=\ 

Furthermore, the terms of the sum on the right hand side may 
in principle be evaluated by 

p(yk I yi:fc-i,0) = / p{yk I x fe ,0)p(x fc | yi :fe _i,0)dx fe , 

( 20 ) 


Expectation-maximization (EM), proposed by Dempster 
et al. g] is an iterative algorithm for finding maximum 
likelihood parameter estimates in settings with some unob¬ 
served variables, such as the state variables x in the state- 
space context. The motivation is that the so-called full-data 
likelihood of the observed and unobserved variables is easier 
to compute, and a lower bound for the marginal likelihood 
of the observed variables may be obtained based on expected 
full-data log-likelihood. In the following, we present the EM 
algorithm following the formulation by Neal and Hinton |33] 
and the notation of Schon et al. 0. 

The EM algorithm is based on the following lower bound 
of the log-likelihood: 


l0gp(yi:T | 0) > y<?(x 0 :T)log 


P(xQ:T,yi:T I 0 ) 
g(Xo -.t) 


dx. 


0 :T> 


(24) 

where q is an arbitrary probability density over the states x 0 : t- 
The idea is to iteratively maximize this lower bound with 
respect to q (holding 9 fixed) and with respect to 9 (holding 
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dmv. 


-7^7— — T. | w :i F x ( m fc-l|fc-l +Lfc_ 1 |fc_ 1 4 j, 0 ) 


( dm k _i\ k _x 9L fc _ 1 i fe _ 1 

V d 6 i ddi 

df . ,11 

+ 7j07 ( m fc-i|fe-i + Lfc_i £j, 0) >, 


dP 


k\k-l 


ddi 


“E 


Fx + L k _i £j, 0) 


c^m^,_ 1 1 fc_ 1 dP k _ i\k —1 


ddi 


ddi 


, 9 f /„ , T ^ ^ 9m fc|fc-i 

+ ddi ( mfc - 1 l fc - 1 + ^7- 

x [f ( m fe-i|fe-i + Pfc-i|fc-i £j, 0) — m fc |j,_ 1 ] 

+ [f ( m fc-i|fc-i + 1 1 fc—1 $jt ®) — m fc|fc-i] 


Fx (mfc-i + Lj,.! , 0) 

dm k _ 1 \ k _ 1 dP k -i\k-i 


ddi 


ddi 


df . . am fc | fc _ 1 i T 

+ -57- ( m fc-l|fc-l + ") ~ ■ 

Oui 


ddi 


dPk _ 

ah 


H x 4- fj k \k-l 0) 


+ 7^7 ( m fc|fc-l + L fc|fe-1 £j 1 0) 


dv k _ d^ k 

ddi ~ ddi ’ 

as fc 

ddi 


?{• 


H x (mfc|fc-i + L fc i fc _ 1 £j, 0) 


/ ^ m fc|fc-l , 0Tfc|fc-l 

V a^ a^i 

ah . , dfi k 

' -QQ- [ m k\k-l + Ffc|fc_i 0) QgT" 


H ( m fc|fc—1 + Lfcifc—j. , O) ~ Pk 
H ( m fc|fc—1 +Lfc|fc_ 1 $j, 0)-Mfc 

H x (m fe |fc_ 1 + L fc | fc _j £j, 0) 


ah . , a/ifc 

+ ael v m fc|fc— 1 + L fc|fc-i ~ "gg" 


i 


3R 

ddi ’ 


aQl 

aej’ 


^ 1 gj (h(m fe | fc _ 1 + L fc | fc _ 1 £ j ,0) 


q fixed). Furthermore, when 0 = 0( n ) is fixed, the maximum 
with respect to q is obtained by 

<Z(x 0 : t) : = P( X 0:T | yi:T, 0 (n) )- (25) 


By substituting this into Equation (I24l> . the bound becomes 

p(x 0 :T,yi:T | 0) 


' dx 0: 


/p(* Kr |y I3 -,«<">)u.g p(xo!i . |yi!Ti9( „ )) 

= J p(x 0: T | yi:T,0 ( ” ) )logp(x O :T,yi:T | 0) dx 0: T 

- j P(X 0 :T | yi:T.0 ( ” ) )logp(x O: T | yi:T, 0 (n) ) dx 0: T- 


The latter term is independent of 0 and may thus be omitted 
when maximizing the lower bound with respect to 0. The first 
term is the conditional expectation of logp(yi : r, Xo : t | 0) 
conditional on 0dd and yi-.r- Thus, the step of maximizing 
the lower bound (Eq. [241 may be replaced by computing the 
following function: 

Q(0,0 (Tl) ) =E[l 0 gp(XO:T,y 1 :T | 9) | yt:T,0 (n) ]. (26) 

The EM algorithm in its general form thus consists of initial¬ 
izing the parameters to 0(°) and for n = 0,1 ,... iterating the 
following two steps: 

• E-step: compute Q(0,0 (r d). 

• M-step: 0( n+1 ) •<— argmax e Q(0,0( n )). 

In state-space models, the Q-function can be decomposed by 
employing the Markov property of the state sequence and the 
conditional independence of the measurements: 

Q(0,0<")) = 1,(0,0 (n >) +1 2 (0,0(">) +1 3 (0,0< n >), (27) 

where the terms are 


I 1 (0,0 (n) ) =E[logp(xo I 0) I yi:T,0 (n) ], (28) 


I 2 (0,0 (n) ) = £Epogp(x fc I x fc _!,0) I yi:T,0 (n) ], (29) 

k— 1 
T 

1 3 (0, 9 {n) ) = J2 E[l°gp(y* I X fc ,0) I yi:T, 0 (n) ]. (30) 

k= 1 


+ L fc|fc-l 

«( 


Z' <9rnfc|fc-l 

ddi 


H x (m fe | fe _ 1 + L fc | fc _ x £j, 0) 

, 0Lfc|fc-i 

<90; 


dh . e a\ d ^ k 

+ dfi + ~oe~ 


dK k _ dC k 

ddi ~ ddi ° k ° k ° k 

dm k \k _ dm^k-x 


q-i c -i dS k „_i 

ddi k 


ddi 


ddi 


dP k \ k _ dPfc|fc-i 


ddi 


ddi 


8K k dv k 

-v u + JKi -, 

dOi dOi ’ 


dK k 

ddi 


dS k 

ddi 


Ki-K k S, 


dK I 

ddi ' 


Fig. 3. Recursion for computing the derivatives of the prediction and update 
steps. F x is the Jacobian of f (x, G) as a function of x and H x is the Jacobian 
of h(x, 0) as a function of x. Algorithms for computing the derivatives of 
the Cholesky factors L such that P = L L T are omitted here (see d). 


To evaluate this expression, one needs the smoothing distribu¬ 
tions p(xt | yi:r,0 (n) ) an d the joint smoothing distributions 
of consecutive states p(x.k,Xk+i \ yi-.T^^). Sigma-point 
approximations to the EM algorithm are then obtained by 
replacing the expectations over the smoothing distributions 
by their sigma-point smoother approximations. The Gaussian 
smoother approximation for Q is 


Q(0,0 (n) ) 

~ -ylog|27rP 0 | - ^ log | 27 t Q | - ^ log |27r R| 

Po|T + (m 0 | T - m 0 ) (m 0 | T - m 0 ) T | 

1 T 

- yEMQ ^[(Xfc - f(x fc _i))(xfe - f(x fc _i)) T I yi ;T ] } 

Z k= 1 









































SUBMITTED TO JOURNAL OF ADVANCES IN INFORMATION FUSION 


7 


- ^X! tr { R l£ [(y k - h(x fe ))(y fe - h(x fe )) T I yi :T ] }, 

fc=1 

(31) 

where Po,Q,R and the model functions f(-),h(-) depend 
on the parameters 9. The smoothing distribution means and 
covariances t, P/dT are obtained during the smoothing 
backward pass. The expectations over the smoothing distribu¬ 
tion in the latter two terms are evaluated by using the sigma- 
point approximations for Gaussian integrals as follows. The 
second expectation depends only on the smoothing distribution 
N(xfc | nifcjT,Pfc|r) an d is computed as follows: 

E[(y k - h(x fc )(y fc - h(x fc ) T | yi :T ] 


~ Wi ( yfc ~ h ( m fc|T) + Lfc|T&) 
i 

x (y k — h( m fc|T) + Lfc|T^i) T - ( 32 ) 

The first expectation depends on the pairwise joint smoothing 
distribution p(xfc, Xfc_! | yi-.r) ( cf Sec. 1 11 - A I Eq. [8j. Thus, 
to evaluate it we need to use 2n-dimensional sigma-points as 
discussed in Section Hl-CI Equation fl5l 

In general, maximizing Q in the M-step requires the use 
numerical optimization, for example, using the Broyden- 
Fletcher-Goldfarb-Shanno (BFGS) algorithm [34]. However, 
using numerical optimization inside EM is quite cumbersome, 
because with the same effort we could numerically optimize 
the approximate likelihood directly. Hence the benefit of EM 
is in the situation when the optimization can be performed in 
closed form. This kind of special case is the class of models 
where the parameters appear linearly although the model itself 
might be nonlinear. 

In the following, we present closed-form solutions for the 
special case where the model functions are linear combinations 
of the parameters where the parameters appear as coefficients 
of the linear combinations and/or the covariances. That is, we 
consider models that can be represented as follows: 


x k = Affxn) + q fc , 
y k = Hh(x fc ) + r k , 


(33) 

(34) 


where f(-) and h(-) are functions containing the nonlinearities 
and the parameters are a subset of {A, H, Q, R, mo, Po}- 
For these models, the expression for Q can be written as 


Q(0,0 (n) ) = 


- ^ log |2 tt Po| - ^ log |2 tt Q| - ^log|27rR| 

- \ tr {P^ 1 Pq|t + ( m o|T - m o) ( m o|T - m o) T } 


2 tr { Q_1 

{r- 1 


T 

-tr 

2 


£ - C A t - A C T + A $ A t 


D B H t - H B t + H © H t 


}• 


where the model parameters to be optimized are some subset 
of {A, H, Q, R m 0 , P 0 } and £, d>, ©, B, C, D can be eval¬ 
uated based on the latest E-step sigma-point smoother results 


as follows: 

1 T 

^ = 7jt + m fe|T [ m fc|T] T ) (35) 

fc=1 

I T 

$ = ^5lE[f( x fe _ i )f T ( x fc _ 1 ) | yi :T ], (36) 

k= 1 
1 T 

© = 7 p ^E[h(x fc )h T (x fc ) | ynr], (37) 

k=l 
1 T 

B = - ^y fc E[h T (x fe ) | yi ;T ], (38) 

k=1 
1 T 

C = -^E[x fc f T (x fc _ 1 )|y 1:T ], (39) 

k =1 
1 T 

D = ^5Z yfcy fc- (40) 

fc=i 


Using these values, the optimal parameters in the M-step, that 
is, the maximum points of the Q(-, 0 ( n ' ) )-function are 
« When 9 = A, we get 

A* = C4&” 1 . 

. When 9 = H, we get 

HT = B©- 1 . 

. When 9 = Q, we get 

Q* = £ - C A t A C T + A $ A t 
. When 9 = R, we get 

R* = D — H B T — B H t + H © H t . 

• When 9 = mo, we get 

m o = m 0 | T . (41) 

• Finally, the maximum with respect to the initial covari¬ 
ance 9 = P 0 is 

Pq = Po|T + ( m o|T - mo) (m 0 |T - m 0 ) T . 


C. Evaluating the Gradient Based on Fisher’s Identity 


The expected log-likelihood that appears in the EM algo¬ 
rithm may also be used as a basis of an alternative approach 
for evaluating the gradient in direct optimization. Based on 
Fisher’s identity, the gradient of the marginal log-likelihood 
may be expressed as 


dC T (9) 

09 


0Q(9,9^) 

09 


e(»)=e 


(42) 


where Q is the function defined in the EM algorithm 
(Eq. \Ti\ . When the Q-function is approximated with sigma- 
point smoothers, we obtain an alternative approximation of 
the gradient of the marginal log-likelihood that may be used in 
place of the approximation derived in Section ITlI-AI For linear 
state-space models, this approach was suggested by Segal and 
Weinstein fi}] and later by Olsson et al. 01 who called the 
approach the ‘easy gradient recipe’. See ytllOj] for discussions 
of the nonlinear case. 
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IV. Experiments 


In this section, we demonstrate the different sigma-point 
schemes and different parameter estimation algorithms with 
two example models. First, we use a one-dimensional model 
(the univariate nonstationary growth model, UNGM, [35. 36|) 
to illustrate the approximate likelihood curves obtained by 
different methods. Second, we compare the performance of 
different algorithms with simulated data in a problem of track¬ 
ing a maneuvering target with bearings-only measurements. In 
this example, we focus on estimating the sensor variances and 
compare the variance estimates as well as the actual tracking 
error. 


A. Simple Nonlinear Growth Model 

We simulated a realization with T = 100 data from the 
following model: 

x k +i =ax k + b——j + c cos(1.2fc) + q k , (43) 
1 + x % 

y k = dx k +r k . (44) 


with a = 0.5, b = 25, c = 8, d = v/0.05, q k ~ N(0,10), 
r k ~ N(0,0.01), Xq ~ N(0,0.01). This is the univariate 
nonstationary growth model 1351136( 1 except that we changed 
the measurement model to linear as the model with the 
typically used quadratic measurement model is known to be 
challenging for sigma-point algorithms [37]. 

First, we estimated the likelihood of parameter a, holding 
other parameters fixed at their ground-truth values. Likelihood 
curves obtained by direct likelihood estimation with various 
sigma-point rules as well as the EM lower bounds for two 
iterations are shown in Figure [4] For comparison, a likelihood 
estimate obtained by particle filtering (1000 particles and the 
optimal importance distribution) is also shown. The EM itera¬ 
tions seem to converge toward the maximum of the likelihood 
curve and the second EM bound is rather close to the particle 
filter likelihood estimate. The EM lower bounds are mostly 
below the sigma-point likelihood curve as expected, except 
that the first EM lower bound slightly exceeds the sigma- 
point likelihood approximation in the vicinity of the initial 
parameter. However, both the evaluation of Q and the sigma- 
point estimate of the likelihood are approximations. 

Second, to compare the different sigma-point rules, we 
considered estimation of the parameter b with other parameters 
fixed and parameter c with other parameters fixed, using a grid 
of parameter values with close proximity to the maximum 
likelihood values. Namely, for b we used 32 evenly spaced 
points between 21.7698 and 22.5698 and for c we used 32 
points between 7.376 and 8.176. These are shown in Figure^ 
The estimate obtained by the Gauss-Hermite rule and the 
estimates obtained by the higher-order UKFs are rather close 
to each other while the estimate by the 3 rd order UKF is farther 
in both parameters. 


B. Coordinated-Turn Model 

In this section, we compare the performance of the pa¬ 
rameter estimation methods discussed in this article using 



Fig. 4. Visualization of the one-step evolution of the EM algorithm for the 
univariate estimation of parameter a. The dotted line represents the particle 
filter log-likelihood estimate, while the solid line is the sigma-point filter log- 
likelihood approximation. The dashed lines correspond to the sigma-point EM 
bounds for iterations n and n + 1. 



- UKF 3 - UKF 5 - UKF 7 -- UKF 9-GHKF 16 


Fig. 5. Log-likelihood curves for parameters b and c evaluated by five different 
sigma-point methods. The vertical line indicates the location of the maximum. 


a more practical example. The problem is tracking a tar¬ 
get maneuvering according to the coordinated turn model 
M 27 . 3S[ 39 1 with bearings-only sensor measurements. The 


state is 5-dimensional: 


= {x\ x 2 Xi ±2 w) T , 


(45) 


where (x\,X 2 ) is the location of the target in 2-dimensional 
Cartesian coordinates, (xi, ± 2 ) is the corresponding speed, and 


w is the turn rate. 

The dynamic 

model is 




(l 

0 

sinAt) 

cos(cjfc At) —1 

o\ 



0 

1 

cos(cJk At) —1 

sin(o;fc At) 

0 


1 — 
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0 

cos (a;*; At) 

— sin(wfc At) 

0 

Xfc+q fc . 
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0 

sin(wfc At) 

cos(wfc At) 
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(46) 
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The process noise is qj, 

(q c At 3 /3 
0 

q c At 2 /2 
0 
0 


N(0, Q), where 


Q = 


V 


q c At 2 /2 0 0 \ 

0 q c At 2 /2 0 

q c At 0 0 

0 q c At 0 
0 0 qujAtl 

(47) 

The measurements are angles from the sensors with additive 
Gaussian noise: 

yk = h(x*;) + rfc, (48) 


0 

q c At 3 /3 
0 

q c At 2 /2 
0 


where the measurement noise is r;,. ~ N(0, R). The covariance 
matrix R is naturally assumed diagonal, as the measurement 
errors of separate sensors should be independent. For each 
sensor i at location s, the measurement is given by 


fti(xfc) = atan2(tc 2 ,fc - s 2 ,j, %i,k - «i,i) , (49) 


where atan2 is the four-quadrant inverse tangent. We focus 
on estimating the measurement noise variances while keeping 
other parameters fixed. That is, the sensor locations and 
dynamic model covariance are assumed to be known and the 
initial state distribution fixed. 

The parameters of the process noise covariance were set 
to q c = 0.1, q u = 0.1 and the time step to At = 
0.01. The ground-truth measurement noise covariance was 
R = diag(0.05 2 , 0.1 2 ). The two sensors were located at 
Si = (—1,0.5) and s 2 = (1,1). The parameters of the 
initial distribution were mo = (2,0,0,0,0) T and Pq = 
diag(0.5 2 0.5 2 , 0.5 2 , 0.5 2 , l 2 ). We simulated 100 different 
trajectories with T = 50 timesteps from this model. 

To compare performance of the different sigma-point 
schemes, we performed direct maximum likelihood estimation 
of the sensor noise standard deviation of the first sensor, 
keeping the noise of the second sensor as well as other 
parameters fixed at their ground truth values. The sigma-point 
schemes used were UKF 3, UKF 5, and UKF 7, as well as 
GHKF 3, GHKF 5, and GHKF 7. The 9 th order schemes 
were omitted since the number of sigma-points is already quite 
high as the state is 5-dimensional. In addition to the sigma- 
point methods, we also compared against maximum likelihood 
estimation based on the extended Kalman filter (EKF, see, e.g. 
0 ). 

The optimization was performed with gradient-based opti¬ 
mization using the Matlab optimization too I bo xQ Furthermore, 
we investigated how the estimation performance varies as a 
function of uncertainty of the target’s initial location. This was 
done by using an additional parameter for the per-coordinate 
standard deviation (a G (0,0.5]). The first two diagonal 
components of Po were set to a 2 . Furthermore, the first two 
components of mo were interpolated between the original mo 
and the simulated Xo to keep the uncertainty of the initial 
location consistent with the prior. 

Since GHKF 7 is the highest-older sigma-point scheme 
amongst those used in this experiment, we assume it is 
the most accurate and compare against it. Figure [6] shows 


'MATLAB version R2014b, the fminunc function, quasi-Newton algo¬ 
rithm, initialized with yTtTTT = 0.1. 


: ^ 
1 5 


nl 



Fig. 6. Median absolute error of the parameter estimates compared to GHKF 7 
(median taken over the 100 simulated trajectories) as a function of the initial 
location prior standard deviation. UKF 5 is essentially indistinguishable from 
GHKF 3. 


comparison of median (over the 100 trajectories) absolute 
deviation of the MLE estimates obtained by the various 
filtering schemes compared to the ones obtained by GHKF 7. 
GHKF 5 is closest, while EKF and UKF 3 are farthest from 
the baseline. UKF 7, GHKF 3 and UKF 5 have similar 
performance. UKF 5 and GHKF 3 are essentially identical. 
This is explained by the observation that in 5 dimensions, 
all UKF 5 sigma-points are present in the GHKF 3 sigma- 
point set and the sum of the GHKF 3 weights of these points 
is 0.79. The contribution of the remaining sigma-points that 
have total 0.21 weight apparently has a negligible contribution 
at least with this model. In addition, we also look at track 
estimation errors using the final parameter estimates by each 
sigma-point scheme. Figure [7] shows the mean RMSE over 
the 100 trajectories, that is, for each simulated trajectory, we 
computed the smoother RMSE and then took the average. 

To compare the two different gradient evaluation ap¬ 
proaches, sensitivity equations (Section IlII-Ab and the Fisher 
identity approach (Section Illl-Ct . we evaluated the derivative 
of the log-likelihood with respect to the standard deviation of 
the error of the first sensor, using UKF 3 and UKF 5 and 
both gradient evaluation approaches. The results are shown 
in Figure [8] With UKF 3, there is clear difference between 
the estimated gradients while with UKF 5 the approaches 
essentially agree. 

To measure the performance as a function of computational 
cost, we recorded the parameter values as well as the times 
used at each iteration of the optimization routines. In this 
experiment, the initial location standard deviation per coor¬ 
dinate was set to 0.5. Median absolute error (compared to 
final GHKF 7 estimate, as a function of time) is shown in 
Figures |9] and [TO] As one would expect, the higher-order 
schemes are more computationally demanding and GHKF is 
more computationally demanding than UKF, but eventually the 
higher-order GHKF schemes find better parameter estimates. 

Figure [TT] shows the evolution of EM parameter estimation 
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Fig. 7. Mean RMSE of smoothed location (mean taken over the 100 
trajectories) using the MLE estimated noise variance of the first sensor. UKF 5 
is essentially indistinguishable from GHKF 3. 


Fig. 9. Median absolute error of the parameter as a function of computation 
time during the optimization. Median taken over 100 datasets. 



Fig. 8. Derivative of the log-likelihood with respect to the standard deviation 
of the error of the first sensor. Evaluated using UKF 3 and UKF 5 both with 
the sensitivity equation approach (Section fill- At and the Fisher identity based 
approach (Section mm . 


Fig. 10. Median absolute error of the parameter as a function of computation 
time during the optimization. Median taken over 100 datasets. The solid lines 
are gradient-based direct optimization, while the dashed lines show EM (run 
for 32 iterations) with the corresponding sigma-point schemes. 


for one simulated trajectory with a = 0.5. The EM algorithm 
practically converges in a couple of steps with all three 
sigma-point schemes, and the final parameter estimates are 
rather close to each other and to the direct MLE estimates. 
Theoretically, the EM algorithm has linear convergence js[| 
although it is hard to say whether these convergence results 
extend to the case where the E-step is approximated using 
sigma-point smoothers. 

V. Conclusion and Discussion 

In this paper together with the complementing conference 
article jl}], we have considered various probabilistic point 
estimation approaches for parameter estimation in nonlinear 
system identification. We discussed direct likelihood maxi¬ 
mization as well as the expectation-maximization (EM) algo¬ 
rithm coupled with various filtering and smoothing algorithms, 
namely, sigma-point filters, particle filters, and extended 
Kalman filters as well as the corresponding smoothers. In this 


paper, we focused on the differences between different sigma- 
point filters based on unscented transforms of third, fifth, 
seventh and ninth orders, and the Gauss-Hermite cubature 
rules. 

In diminishing order of computational complexity and theo¬ 
retical exactness, the filtering methods would rank as follows: 
particle filter, sigma-point filter, extended Kalman filter based 
direct likelihood approximation. In theory, particle filters con¬ 
verge to the exact filtering solution as the number of particles 
increases, while the other methods considered are based on 
assuming a Gaussian density and using the Kalman filter 
equations. However, especially in high-dimensional cases, the 
computational cost may prohibit the use of particle filters. In 
practice, the assumed density Gaussian filtering approach may 
have satisfactory performance if the nonlinearity is not too 
high. In principle, all sigma-point filters are based on assuming 
Gaussian density, and a higher-order cubature rule should lead 
to more accurate approximation of the Gaussian integrals at the 
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Fig. 11. The plot on the left shows the evolution of the parameter estimate 
during the first 10 EM steps using UKF 3, UKF 5, and UKF 7. All methods 
converge essentially at same speed to the same value. The plot on the right 
shows the evolution of the parameter estimates after the first 10 EM steps as 
well as the corresponding direct MLE estimates (the dashed lines). 


cost of higher computational burden. However, typically in the 
literature ( e.g. 1391) it has been claimed that when the Gaussian 
density approximation is already inaccurate, more accurate 
computation of the Gaussian integrals is not beneficial. 

We also tested the methods in two simulated case studies. 
In the univariate nonstationary growth model, maximum like¬ 
lihood estimates produced by different sigma-point schemes 
were similar. However, the estimates obtained by higher- 
order unscented schemes were closer to the Gauss-Hermite 
(order 16) baseline than the conventional 3 rd order unscented 
transform. This suggests that the higher order methods may 
indeed have some utility. 

In the target tracking experiment, we compared the esti¬ 
mates of the noise standard deviation of one of the sensors 
as a function of prior uncertainty of the target’s location 
using each of the sigma-point schemes. With higher prior 
uncertainty, there were more differences amongst the methods. 
This is reasonable because when there is less uncertainty in 
the model, all the methods obtain more accurate parameter 
estimates. Furthermore, the nonlinearity of the model has a 
stronger effect when the state variance is larger. Since no exact 
maximum likelihood estimate was available, we compared 
to the highest-order sigma-point scheme, namely, GHKF 7. 
Compared to that, the Gauss-Hermite schemes were closer 
than the unscented transform based schemes and higher order 
schemes were closer than lower order schemes. Thus, the 
results are consistent with an assumption that the higher-order 
sigma-point methods produce better approximations to the 
Gaussian filtering result. 

We also measured the performance of the discussed op¬ 
timization routines as a function of computational time. In 
direct gradient-based optimization, higher-degree algorithms 
are more time-consuming but eventually seem to obtain bet¬ 
ter parameter estimates. The EM algorithm with low-degree 
sigma-point schemes (UKF 3, UKF 5) was initially faster than 
the gradient-based optimization, but eventually the gradient- 
based optimization seems to obtain better values. The EM 
algorithm with UKF 7 sigma-points was more computationally 
demanding than the direct optimization with UKF 7. This is 


due to the fact that EM requires the 2n-dimensional sigma- 
points for the smoothing distribution. This suggests that EM 
is not applicable in high-dimensional problems combined with 
high-degree sigma-point schemes. However, when interpreting 
these findings, it should be noted that we compared to the 
GHKF 7 estimate since the true maximum-likelihood estimate 
was not available. 

The sigma-point integration schemes are derived by assum¬ 
ing exact integration results for polynomials of certain degrees. 
Thus, it should be noted that it is not guaranteed that a higher- 
order integration rule produces a more accurate results, even 
though it is accurate for higher-order polynomials. Further¬ 
more, it is not guaranteed that a better approximation to the 
Gaussian filtering result produces a better approximation to 
the exact maximum likelihood result. On the other hand, there 
is no reason why in general a lower-order approximation to 
the Gaussian filtering integrals would produce more accurate 
approximations to the exact filtering results. 

We also compared the actual tracking performance in terms 
of the smoother root mean square errors of the target locations 
using the smoother results obtained with the maximum likeli¬ 
hood parameter estimate of each sigma-point filter. There was 
no clear differences between the different sigma-point schemes 
in terms of the tracking error in this experiment. The tracking 
error of EKF increased more rapidly as a function of the initial 
location uncertainty, which demonstrates the local linearization 
nature of EKF. 

In five dimensions, the UKF 5 sigma-point scheme approxi¬ 
mates the GHKF 3 scheme in the sense that all UKF 5 sigma- 
points are GHKF 3 sigma-points as well, and more than half 
of total weight is contributed by these points. The target¬ 
tracking experiment demonstrated that these two schemes 
indeed produce almost equal results. However, most GHKF 3 
sigma-points are not used in the UKF 5 scheme, and thus 
evaluating an integral by UKF 5 requires considerably fewer 
function evaluations. These results suggest that there is no 
reason to use GHKF 3 in five dimensions as UKF 5 produces 
essentially same results with fewer computations. It is possible 
that similar computationally lighter close approximations exist 
for other Gauss-Hermite based sigma-point schemes as well. 

In the target-tracking experiment, we also investigated how 
the performance of the EM algorithm varies with the sigma- 
point scheme used. As a function of EM iterations, the 
evolution of the parameter estimate was not affected by the 
choice of sigma-point scheme. However, the number of sigma- 
points in the UKF 3 rule is a small fraction of the number 
of sigma-points with the higher order rules. Thus, measured 
by model function evaluations, EM with the UKF 3 rule 
converged faster. This suggests that even when the interest 
lies in obtaining as accurate parameter estimates as possible, 
a reasonable computational approach would be to first use 
EM with a low-order sigma-point scheme, such as UKF 3, to 
obtain a ballpark estimate. Then, if accuracy is desired, the 
initial estimate could be refined using a more accurate sigma- 
point scheme combined with a direct optimization algorithm. 

In this paper, we considered only discrete-time state-space 
models. Different sigma-point schemes may also be used for 
continuous-discrete state-space models (see, e.g., Si and 
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references therein). We considered only fixed deterministic 
sigma-point schemes. An interesting future research topic 
could be to combine the recently proposed filters based on 
adapting or randomizing the sigma-points 141 44311 with pa¬ 
rameter estimation. 

Finally, we attempt to conclude which of the methods 
considered here one should use in practice. Regarding the 
choice of sigma-point methods the higher order unscented 
transform methods turned out to be quite good in the examples 
that we considered—but if the best possible accuracy is 
desired, then Gauss-Hermite methods need to be used. The 
EM algorithm is indeed useful in situations when the M- 
step optimization can be done in closed form—of which 
important special cases are the linear-in-parameters models 
considered here. However, for nonlinear in parameters models 
EM might not be a good choice. For nonlinear in parameters 
models it is thus beneficial to directly optimize the log- 
likelihood, and in that case we have the choice to evaluate 
the gradients either using the sensitivity equations or using 
the Fisher’s identity. It turned out that the Fisher’s identity 
is often computationally more demanding than the sensitivity 
equations, due to the requirement of smoothing pass, which 
favors the use of sensitivity equations for this purpose. Further¬ 
more, the sensitivity equations give the exact gradients of the 
approximate likelihood whereas Fisher’s identity only gives 
approximate gradients of it. However, the Fisher’s identity has 
the advantage of easy black-box implementation which can 
sometimes be seen as an advantage. 

Appendix 

M-step in the linear-in-parameters case 

By substituting the linear-in-parameters model (f(x) := 
Af(x), h(x) := Hh(x)) into the general expression for 
C2(0.9 (")), we obtain 

Q(0,0 (n) ) 

~ ^ log |2 tt P 0 | - ^ log |2 tt Q| - log 12zr R| 

p o|T + (m 0 | T - m 0 ) (m 0 | T - m 0 ) T j 

1 T 

- -^tr{Q _1 E[(x fe - Af(x fc _i)) 


k =1 


x (x fe - Af(x t _i)) T I y 1:T ] } 


1 

- ^X! tr { R ~ lE [( yfe ~ Hll (Xfc)) 


= - — tr {Q 1 ^ ^E[(x fc - Af(xfc_i)) 
k= 1 

\T 


= "tr{Q 1 ^ ~ x fcf( x fc-i) T A T 


- Af(x fe _i)xJ -f A f(xfc_i)f(xfc_i) r A t | yur]}, 


(51) 


which due to linearity of expectation equals 

T 


T 

= -2 tr 


: X I I YET] 


{q 1 f 5Z E [ Xfc: 

k =1 

- J, (5Z E i Xk f( x fe-i) T I yi:r]) A T 

k= 1 
1 T 

-A - ^ E [f(xn)x T I Yi-.t] 

k=l 
1 T 

+ A - £ (E[f(x fc -i) f(x fc _!) T I y 1:T ]) A t ] }. (52) 


fc=l 


Noting that E[x fc xj | yi :T ] = m fc | T m^ |T + P fc | T and 
substituting in the notation introduced in Equations (f35l44Qt . 
we obtain 


T 

=-tr • 

2 


jcr 1 E-CA t -AC t +A$A t }. (53) 

Similar calculation for the last term in Equation (l50l) . noting 
that E[y fc yj | y 1:T ] = y k yj, gives 


T 

-tr 

2 


{ R_1 


D — B H t — H B t 4- H © HT 


]}■ (54) 


Substituting Equations ( |53] > and ( 154b into Equation ( l50l >. we 
get 

Q(e.e^) = 

- l°g I2 tt P o| - 7 j- log |27 t Q| - log |27 t R| 

- | tr |p„ 1 P 0 |T + ( m 0 |T - m o) ( m 0 |T - m o) T } 


T 

- "2 tr 
T 

-tr 

2 


{q- 1 

{b - 1 


S-CA AC 


dbh t -hb t + h© 


T 4- A $ A T j | 
9H T ]}. 


To maximize this with respect to the parameters 
(m 0 ,P 0 , A,H, Q,R) we differentiate with respect to param¬ 
eter in question and set the derivative to 0. For Q: 

-3§ '~fd5 logl121(31 

T d r 

-tr \ Q 1 

2 dO 


k=1 

x (y fc - Hh(x fc )) T | yi :T ]}. (50) 

Since trace is linear, the penultimate term can be written as 

T 

2 ” L-v j, 

k- 

x (Xfc - Af(x fc _i)) T |yi:r]} 


2 dQ 
+ ?Q- 


S - C A t - A C T 4- A $ A t 


]} 


E - C A t A C T 4- A $ A t 


Q 1 


(55) 


Setting the derivative equal to 0, we obtain the equation 




E — C A t -AC t + A$ A t 


Q 


(56) 


*;=i 


Multiplying from right by y Q and from left by Q gives 
Q = E - C A t - A C T + A $ A t . 


(57) 
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The derivations for the optimal solutions of R and Po are 
similar. For A: 


dQ 

dA 


f[-d^tr( Q -CA T ) 

^ tr(Q- 1 A C T ) + A tr(A A T )] 

| Q' 1 [2 A 3> - 2C] (58) 


Since Q 1 is nonsingular, the derivative is zero only if the 
last factor is zero. If 3? is invertible, this in turn implies 


A = C^- 1 . 


(59) 


The derivations for the optimal solutions of H and mo are 
similar. 

If the parameter 6 is any subset of {A, H, Q, R, mo, Po}, 
it can be optimized by these closed-form expressions. First, 
note that (A, Q), (H, R) and (mo, Po) are independent in the 
sense that, for example, the optimal A and Q do not depend 
on the other four parameters. Furthermore, the optimal A does 
not depend on Q. Thus, A and Q can be jointly optimized by 
first solving the optimal A and then substituting that into the 
expression of optimal Q. Similar reasoning works for (H, R) 
and (m 0 ,Po). 
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